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Abstract 

Background: Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a 
tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts 
have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are 
costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of 
structure fragments for structure comparisons. 

Findings: We present ppsAlign, a 2arallel 2rctein structure Align ment framework designed and optimized to 
exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take 
many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated 
ppsAlign on an NVIDIA Tesia C2050 GPU card, and compared it with existing software solutions running on an 
AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40- 
fold speedup over MAMMOTH. 

Conclusions: ppsAlign is a high-performance protein structure alignment tool designed to tackle the 
computational complexity issues from protein structural data. The solution presented in this paper allows large- 
scale structure comparisons to be performed using massive parallel computing power of GPU. 



Background 

Large-scale protein structure comparison is becoming a 
more and more important approach to providing a bet- 
ter picture for understanding biological systems [1,2]. 
Given a database of protein structures, the main goal is 
either to find proteins that are structurally similar to a 
given protein (i.e., one-against-all comparison) or to 
build various connectivity among proteins by performing 
exhaustive comparisons on the whole database (i.e., all- 
against-all comparison). The results of structural com- 
parison are useful in discovering potential structural, 
evolutionary, and functional relationships among these 
proteins and have significant impact on structure-based 
drug design [3], protein-protein docking [4], and other 
biological findings [5]. Recently, the dramatic increase in 
protein structural data [6] has led to an ever increasing 
demand for structure alignment tools that can not only 
find accurate alignments at residue level but also 
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complete large-scale structure comparisons in a reason- 
able time. 

Several approaches have been developed to address 
the limitations of traditional alignment methods and 
tackle the computational issues. The traditional align- 
ment methods [1,2,7], such as DALI [8], CE [9], TM- 
align [10], Fr-TM-align [11], and MAMMOTH [12], are 
based on the comparison of residues or fragments to 
build initial alignments which are optimized by various 
procedures, such as Monte-Carlo, combinational search, 
and dynamic programming. These methods can provide 
accurate alignments at the residue level but are usually 
computationally expensive, which makes them infeasible 
in coping with very large datasets. To accelerate this 
process, one approach is to map the protein structures 
into ID sequences and then use various sequence align- 
ment methods to align two structures [13,14]. Another 
approach [15] utilizes a "bag of words" method, which 
depends on frequency of specific structural patterns, to 
provide speedy structure match and filtering. These 
approaches significantly improve efficiency for large 
datasets; however, this is often achieved at the cost of 
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loss of topological details, which could lead to lower 
accuracy than the traditional structural comparison 
methods or could be unsuitable to perform residue-level 
alignment. Another approach is to parallelize traditional 
algorithms using a cluster or grid environment consist- 
ing of thousands of computing nodes [16,17]. These 
approaches can fulfill the desires of efficiency and accu- 
racy but require high-performance computing environ- 
ments which are energy-consuming and may not be 
accessible to the biologists. 

With the increase in performance and programmabi- 
lity of many-core Graphic Processing Units (GPUs), 
more and more bioinformatics applications have been 
deployed on GPUs and have shown promising results in 
terms of speedup over their conventional CPU imple- 
mentations. Liu et al. [18] implemented a GPU-based 
Smith-Waterman algorithm [19] for pair-wise DNA 
sequence alignment. Later, the efficiency of sequence 
alignments has been continuously improved in [20-23]. 
Vouzis and Sahinidis developed GPU-BLAST (Basic 
Local Alignment Search Tool) [24] to accelerate NCBI- 
BLAST [25]. Hung et al. developed a method for calcu- 
lating RMSD (Root Mean Square Deviation) after super- 
position for ATI GPU card [26]. Stivala et al. utilized 
simulated annealing (SA) to develop a protein substruc- 
ture searching algorithm, SA Tableau Search, to find 
structural motif at level of secondary structure element 
(SSE) [27]. It is worth mentioning that from the litera- 
ture the SA Tableau Search is the first attempt to apply 
GPU in protein structure comparison at the SSE level. 
Other applications include protein-protein docking [28] 
and statistical phylogenetics [29]. 

In this paper, we present ppsAlign, a parallel protein 
structure Alignment framework which is designed and 
optimized to exhaustively exploit the parallelism of the 
GPU architecture for residue-level structure compari- 
sons. Our experimental results (reported on a NVIDIA 
Tesla C2050 GPU card) show that ppsAlign significantly 
outperforms existing structural alignment tools in com- 
putational efficiency. 

We believe that GPU's massive parallel computing 
power can unlock the door to a cost-effective and high- 
performance computing environment that can be bene- 
ficial to the structural biology community. 

Findings 

Overview 

The framework of ppsAlign is shown in Figure 1. The 
inputs include a target protein and a protein database A 
= Pn}' The outputs are structure alignments 

between the target protein and each database protein. 
The online alignment starts with a generation of some 
initial sets of matched fragments and corresponding 
alignments. Then, the initial alignments are extended 



and refined using Dynamic Programming to obtain the 
final results. Specifically, the ppsAlign algorithm consists 
of 5 steps: 1) Index-based matched fragment set (MPS) 
search is utilized to find the maximal Nseed seed MPS' 
between the target protein and each database protein; 2) 
Fragment-level alignment is used to assemble the MPS' 
and generate initial alignments; 3) Residue-level align- 
ment is used to refine the initial alignments to residue 
alignments; 4) Maximal alignment search is used to find 
a transformation that can best superimpose the entire 
target protein over each database protein based on the 
obtained residue alignments; 5) Final assessment is per- 
formed to calculate z-Score and evaluate statistical sig- 
nificance of alignments. Steps 1) and 5) are executed on 
the CPU core, while steps 2) ~ 4), the most time-con- 
suming parts of ppsAlign, are implemented as GPU ker- 
nels and iteratively executed on GPU for Niter times. 
The GPU kernels are developed using CUDA (Compute 
Unified Device Architecture) programming model [30]. 
During the alignment, the protein structures and inter- 
mediate results from each GPU kernel are stored in 
CPU's on-board memory, such as read-only constant 
memory, read-only texture memory, and read-write glo- 
bal memory. Generally, the constant and texture mem- 
ory have limited capacity but high access rate compared 
to the global memory. For an overview of GPU architec- 
ture and CUDA model, readers are referred to [30,31]. 
To facilitate the search of structurally similar fragments 
from the protein database, ppsAlign has an off-line com- 
ponent that pre-processes substructures from the entire 
protein database and builds an indexing tree to allow 
fast retrievals. 

Index-based matched fragment set search 

The purpose of this CPU-based step is to quickly find 
all possible matched fragment sets (MPS') between the 
target protein and each database protein for further 
refinement based on an information retrieval (IR) 
approach which goes beyond the capability of the tradi- 
tional "bag of words" concept by introducing spatial 
relationships among these fragments. Let 
Q = {qi,q2, • • • .^?Lq} and P= {pi,p2, pip} be a target 
protein with Lq residues and a database protein with Lp 
residues, respectively. Here, q and p represent 3D coor- 
dinates of the Ca atoms. A fragment /is a set of Lf { = 
8) continuous residues with the direction from N term- 
inal to C terminal along the protein backbone. A MPS 
includes two non-empty subsets, Fq and Fp, which con- 
tain an order of fragments that conforms to some cri- 
teria of structural similarity between Q and P, 
respectively. The fragments in a MPS will then be used 
to generate a rough alignment between Q and P in the 
fragment-level alignment. 
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Figure 1 Framework of ppsAlign. The framework consists of both GPU- and CPU- based processes. The input includes a target protein and 
database proteins. The output contains all the structural alignment results between the target protein and each database protein. 



The MFS search utilizes the substructure mapping 
method of the Index-based Substructure Alignment 
algorithm [32], developed by the authors, to retrieve 
similar fragments from the database proteins. In this 
method, substructures of the database proteins, 
extracted by a large set of pairs of windows along the 
backbones, are indexed off-line by an indexing tree in 
which similar substructures are clustered into same leaf 
node, denoted by tf , and one substructure is selected 
as representative for each leaf node. Such representative 
structures preserve certain topological information, both 
locally and globally, from two disjoint substructures 
with various ranges of distances. Similarly, substructures 
in the target protein Q are indexed by an indexing tree 

in which each leaf node is denoted by . The represen- 
tative substructure of each is used to search the 

I 



indexing tree of database and a list of best matched is 
returned. For simplicity, we use t to denote and f^. 

The database proteins that have substructures in can 
be found by an inverted index. Such a database protein, 
P, can be represented by an order of substructures, 
denoted by il^, occurring in t. Likewise, the protein Q 
can be represented by an order of substructures, 
denoted by occurring in t. As substructures identi- 
fied by the same t are similar, they can be used as 
"anchors" for rough alignments. For detailed explanation 
of the substructure mapping method, readers are 
referred to [32]. 

In ppsAlign, substructures are further projected into 
fragments as follows: if any residue of a substructure 
from (or Q^) is located in a fragment, the frag- 
ment is selected and added to Fp (or Fq). The fragment 
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subsets Fp and Fq are used to construct a MFS between 
the protein Q and P, After searching all we can 
obtain all possible MFS' between Q and database pro- 
teins, if any. In this step, if the algorithm cannot find 
any MFS for a database protein, all the fragments from 
Q and the database protein are selected to form a MFS. 
An example of MFS searching and construction is illu- 
strated in Additional file 1: Figure SI. 

After searching MFS, a filtering process is called to 
remove redundant MFS'. Then, the non-redundant MFS' 
between Q and each database protein are ranked 
according to scoring function Smfs and the top Ngeed 
sets are selected. The scoring function is defined as fol- 
lows: 



Smfs = wi 



+ W2 



Np min (Nq Np) 
— + W3 '■ 



N: 



max (Nq,Np) 



where Nq and Np denote the cardinality of Fq and Fp 
in a MFS, respectively. 

NfQ = \LQ/Lf~\ and Nfp = \Lp/Lf~\ are the numbers of 
fragments in the target protein and a database protein, 
respectively. The third term of the above scoring func- 
tion is used to favor MFS' which have comparable Nq 
and Np. The values Wj, W2> and W3 are used to weight 
the contributions from the three terms. 

The data needed by ppsAlign in order to compute the 
alignments on GPU are: structures of the protein Q and 
of the database proteins, and MFS'. To allow efficient 
processing, those data must be judiciously laid out on 
the GPU memories. Specifically, the database structures 
are transferred to the texture memory before execution. 
The MFS' are transferred from CPU memory to GPU 
global memory as inputs to the fragment-level alignment 
(see Figure 1). Finally, the structure of protein Q is 
stored in the constant memory, which has smaller capa- 
city but lower access latency compared to the texture 
memory. 

Fragment-level alignment 

In this step, the fragments in each MFS are assembled 
to obtain initial alignments using Dynamic Program- 
ming (DP). For a given MFS, the DP algorithm first 
sorts the fragments from Fq and Fp according to their 
locations in Q and P, Then, it computes the similarity 
score S/^i, j) of each fragment pair for 1 < i < Nq and 1 
< 7 < Np using the following recurrence: 



sf {hi) 



max 



Sf {hi -l) + Gf 
Sf {i - 1, j) + Gf 



where Gf is gap penalty and Sf is based on the inverse 
cosine distance of fragment's feature vector. Given a 



fragment pair, A and B, and their corresponding feature 
vectors and D^, Sf is calculated as follows: 



1 



cos 



IIDaIMID^I 



where <Da, Db > is the inner product of and D^, 1 1 
D^ll and \ \Db\ \ are the norm of and D^, respec- 
tively. In the current implementation, features only use 
Euclidean distance of each residue pair for fast calcula- 
tion. The main reason for using feature distance as an 
approximate measure of fragment similarity is the need 
for simple control paths due to the SIMT (Single 
Instruction, Multiple Thread) computing mode of the 
GPU [30]. Traditional methods usually calculate RMSD 
and find an optimal transformation using the Kabsh 
algorithm [33], which contains complex control flows 
and is therefore not suitable for the SIMT mode. This 
step provides a rough alignment result which will be 
refined by the residue-level alignment. 
GPU computation for fragment-level alignment 
The pseudo-code in Figure 2 describes the fragment- 
level alignment. The algorithm splits the computation 
into three GPU kernels. The first kernel performs the 
computation of the fragment scores Sf by assigning a 
database protein to each thread. This kernel performs 



Algorithm 1 Fragment-level alignment 



// GPU kernel for fragment similarity score calculation 
1: for /= 1 to ILg/Lf] do 
2: for /= 1 to \Lp/Lf\ do 
3 : Calculate using Equation (3) 
4: end for 
5: end for 

6: Batch number Nhafch^l^F^^Tl 
7: for all batches do 

// GPU kernel for Dynamic Programming 
8: Initialize each thread using MFS 
9: for 1 to No do 
10: for 1 toTVpdo 
1 1 : Calculate using Equation (2) 

12: Store 5"/ and directions in GPU global memory 

13: end for 
14: end for 

// GPU Kernel for back tracing 
15: Start from the largest score in last column/row 
16: Back tracing using direction matrix to obtain 

alignment paths 
17: Copy fragment alignments paths from GPU to CPU 

memory 
18: end for 



Figure 2 Algorithm of fragment-level alignment. Fragment-level 
alignment consists of three GPU kernels. The first kernel performs 
the computation of the fragment scores. The second kernel 
implements the Dynamic Programming algorithm, and the third 
one performs the back tracing. 
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all-against-all fragment comparisons and writes similar- 
ity scores into the GPU global memory. 

The second GPU kernel implements the DP algo- 
rithm, whereas the third one performs back tracing. The 
total number of threads Nt that can run concurrently 
on the GPU is mainly limited by the global memory 
capacity of the GPU (in this phase each thread requires 
approximately 10 kB of memory). Suppose that the total 
number of MFS between Q and all database proteins is 
Np. If Np >NTy the overall MFS' will be divided into 
^batch = ^Nf/Nt~\ batches. ppsAlign sequentially sche- 
dules each batch to run on GPU. In each batch, the DP 
is first executed as a GPU kernel and each thread corre- 
sponds to a MFS. Then, the GPU kernel for the back 
tracing is called to obtain alignment paths for each 
MFS. When a batch terminates, ppsAlign transfers the 
output (i.e., alignment path for each MFS) from the 
GPU memory to CPU memory. After aggregating the 
outputs from all batches, ppsAlign first performs filter- 
ing to remove redundant alignments, and then assem- 
bles all the fragments along the alignment paths to form 
residue alignments which will be further refined by the 
residue-level alignment. 

It is critically important to effectively utilize the lim- 
ited memory resources of the GPU. Our GPU memory 
allocation scheme is exemplified in Figure 3. The MFS' 
are stored in a 2D block of size {Nt x Ng) where Ng is 
the maximal size of all MFS'. Each thread of the DP ker- 
nel fetches a MFS to initialize its setting. The score and 
direction matrices are stored in a separate 3D memory 
block of size {Nq x Np x Nt)i where Nq and Np repre- 
sent the maximal number of fragments from the target 
protein and all the database proteins, respectively. The 
alignment paths are then stored in a 2D block of size 
{Np X Np)' In ppsAlign, multiple GPU memory accesses 
are coalesced into a single transaction whenever possi- 
ble. This fragment-level alignment process provides a 
selection of seed fragments which are likely to be suc- 
cessful in accurate alignment. Only approximately 1.6% 
of the total execution time is spent in this phase. 
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Figure 3 GPU global memory layout of fragment-level 
alignment. The Matched Fragment Sets (MPS') are stored in a 2D 
block. The score and direction matrices are stored in a separate 3D 
memory blocl<. The alignment paths are stored in a 2D block. 



Residue-level alignment 

The results of fragment-level alignment are then refined 
by a residue-level alignment process. Such a refined 
alignment result is an ordered set R - {{q^ p^ \ qi^Q', 
Pi^P'}, where Q' Q Q (target protein) and P' Q P (data- 
base protein). 

In this step, a rigid-body transformation (rotation and 
translation) T that minimizes the RMSD of R is first cal- 
culated. Then, the transformation T is used to superim- 
pose all the residues from Q over P, Finally, the DP 
algorithm is used to find an alignment path between Q 
and P similar to the fragment-level alignment. In the 
DP, the gap penalty is set to 0 and the residue simi- 
larity score Sy. uses the scoring function from TM-align 
[10]. However, our framework can be configured to use 
any suitable residue-level scoring function [1]. 

As we mentioned previously, the complex control 
flows present in the traditional method for computing T 
(e.g., Kabsch algorithm [33]) make it unsuitable for the 
SIMT computing model of GPU. To address this issue, 
we implement and optimize a fast algorithm using qua- 
ternion-based characteristic polynomial (QCP) [34], 
gRMSD-QCP, to determine the transformation T on 
GPU. In the gRMSD-QCP kernel, coordinates of resi- 
dues from two protein structures are first written into 
the GPU global memory and origin of coordinate is 
moved to the center of coordinates for each protein. 
Then, the inner-product of two coordinate matrices is 
calculated, which is used by QCP for RMSD calculation. 
The work flow of gRMSD-QCP is relatively simple, and 
therefore amenable of efficient GPU implementation. 
GPU computation for residue-level alignment 
The GPU implementation of residue-level alignment 
starts with loading coordinates of residues from R to the 
GPU global memory. Next, the gRMSD-QCP kernel is 
invoked to calculate the transformation T which is also 
written into the GPU global memory. Finally, a DP ker- 
nel is called to find residue alignments which are trans- 
ferred into the CPU memory after the kernel terminates. 

As in the fragment-level alignment phase, the residue- 
level alignments are divided into batches according to 
the memory requirement of the threads. After all the 
batches are executed, ppsAlign aggregates the outputs of 
residue alignment R, which are used in the next step for 
searching the maximal alignment. 

Maximal alignment search 

The maximal alignment search is used to find the lar- 
gest subset M Q R such that the score of the residue 
alignment R, denoted by Sa> is maximized. Because find- 
ing the largest subset M is extremely time-consuming, a 
heuristic and approximate algorithm, MaxSub [35], has 
been developed to solve this problem. In ppsAlign, a 
variant of MaxSub, gMaxSub, is designed to parallelize 
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the search process on the GPU. In the current imple- 
mentation of ppsAlign, Sa is defined using the TM-score 
[10]. 

GPU computation for maximal alignment search 

The input of this step is the aUgnment R from the resi- 
due-level alignment which has Lj^ aligned residue pairs. 
The original MaxSub algorithm on CPU searches the 
largest subset M by shifting a window W of size 
along R (see Figure 4a). This results into {L]^'Ly(^ + 1) 
shift operations which are candidates for parallelization. 
Then, gMaxSub searches the maximal alignment by 
concurrently dispatching each calculation of W to differ- 
ent GPU threads (see Figure 4b). 

Figure 5 describes a pseudo-code of gMaxSub. First, 
for each residue alignment R between Q and P, {L]^'Ly(^ 
+ 1) windows are generated. Second, the gRMSD-QCP 
kernel is invoked to calculate the transformation T for 
the residue pairs within each W and then T is used to 
superimpose residues from Q over P in R. Third, residue 
pair {qi, Pi)^R is added into W^if its distance is below a 
cutoff (4.0 A) after the superimposition. The above two 
steps (i.e., gRMSD-QCP and window extension) are 
iteratively executed for Nms times. Forth, the last W is 
assigned to M and Sa is calculated. 

As in previous phases, the maximal alignment 
searches are divided into batches. After all the batches 
are executed, ppsAlign aggregates the outputs of subset 
and selects the one with the largest S^. as the largest 
subset M, The transformation T associated with the lar- 
gest subset M is used to superimpose all the residues 
from Q over P and the residue pair whose distance is 
below a cutoff (4.0A) is selected to form a new residue 
alignment R. 

After gMaxSub terminates, if the current iteration 
number <Niten the residue alignment R will be first fil- 
tered to remove redundant alignments from the same 
database protein and then sent to the residue-level 
alignment for further refinement; otherwise, R will be 
used as input for the next step of final assessment. 



Residue alignment R 



~ Search in parallel 



Sliding window W 

Figure 4 Comparison of MaxSub and gMaxSub. The original 
MaxSub algoritlim on CPU searclies tine largest subset by shifting a 
window 1/1/ along the residue alignment R. The gMaxSub searches 
the maximal alignment by concurrently dispatching each calculation 
of 1/1/ to different GPU threads. 



Algorithm 2 Maximal alignment search 



Load residue alignment R to GPU global memory 
Calculate shifting window W for each R 
Nyy <— total number of windows 
Batch number /V/,,,,,/, = \NJNt\ 
5: for all batches do 

// GPU kernel for gMaxSub 
W= 0 

for / = 1 to Nms do 

Call kernel gRMSD-QCP to calculate transformation T 
Superimpose residues in R with T 
for each residue pair in R do 
if distance < 4.0 A then 

Add the residue pair to W 
end if 
end for 
end for 

Calculate score Sa 
end for 

Select W with maximal Sg as the largest subset M for R 



Figure 5 Algorithm of maximal alignment search. For each 
search, the gRMSD-QCP kernel is invoked to calculate the 
transformation superimpose residues with each window. The 
residue pair is added into the window if its distance is below a 
cutoff. 



Final assessment of alignment quality 

After structure alignments are computed on GPU, the 
residue alignments R are transferred from the GPU 
memory to CPU memory. We use PSI (percentage of 
structural similarity), defined as the percentage of resi- 
due pairs from R with distance below 4.0 A, to score the 
alignment quality. We also assess the statistical signifi- 
cance of the alignments through z-Score of the PSI, 
which is given as follows: 



z - Score 



PSI - ^xpsi 

O'PSI 



where pipsi and Gpsi denote mean and standard devia- 
tion of PSI for a given protein chain length, respectively. 
The parameters /Upsj and Opsj are obtained using a 
method similar to [12], leading to the following settings: 
fipsi = 375.64./c'^-^^^^ and Gpsi = 99.67./c'^-^^^l Here, k is 
the minimum chain length between target and database 
proteins. 

Results 

In this section, we compare ppsAlign s performance to 
concurrent methods in terms of alignment quality and 
computational efficiency. We evaluate ppsAlign using an 
NVIDIA Tesla C2050 GPU card equipped with 448 
cores at 1.15 GHz and 3 GB global memory. The con- 
current methods include TM-align [10], Fr-TM-align 
[11], and MAMMOTH [12], which share similar com- 
putational framework as ppsAlign, As DALI [8] and CE 
[9] have been exhaustively evaluated elsewhere [10], we 
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do not include these approaches in our experiments. We 
download software packages of these methods from 
their official websites and evaluate the performance on a 
Linux personal computer with AMD Opetron dual-core 
1000 series processor at 1.8 GHz and 8 GB RAM. 

The main purpose of structure alignment is to maxi- 
mize the number of aligned residues (Ne) while mini- 
mizing the RMSD of the aligned residues, denoted by 
cRMSD. To eliminate the size dependence of cRMSD 
on Ng, in this paper we use a normalized measure of 
cRMSD, RMSDiooj to evaluate the alignment quality. 
RMSDioo is calculated as follows [36]: 



i^SA^Dioo 



cRMSD 



1 + In 



100 



which corresponds to the cRMSD value expected if 
the two protein structures were 100 residues long. 

To evaluate efficiency, we measure the execution 
time on a dataset in which the protein's chain length is 
in a range from 80 to 500 residues extracted from 
ASTRAL 1.75 database [37] with sequence identity < 
40% (ASREAL40). The database protein chain length is 
determined by the global memory capacity on the 
GPU card. However, this limitation is not severe as 
98.5% ASTRAL40 protein chains have less than 500 
residues. We expect that the advancement of GPU 
technology will solve this memory limitation issue in 
the near future so that the ppsAlign algorithm can 
handle protein chains longer than 500 residues. Cur- 
rently we can handle structures larger than 500 resi- 
dues in one of the following two ways: 1) by sending 
the alignment tasks to our CPU-based algorithm and 
2) if resource allows, by using another GPU card to 
align the remaining 1.5% of large structures. Although 
the algorithm can also handle small protein chains 
below 80 residues (-16% of ASTRAL40), we do not 
use them for our testing because they have relatively 
simple topologies [38]. 

To efficiently utilize global memory of GPU card, the 
entire database proteins are sorted according to the 
chain length and then divided into two small datasets: 
1) Di, which includes 6, 569 proteins in the range [80, 
250) residues selected from ASTRAL40 according to 
the length distribution of proteins, and 2) D2, which 
includes 1, 912 proteins in the range [251, 500) resi- 
dues. The target dataset includes 100 proteins which 
are randomly selected in the range [80, 250) from 
ASTRAL40. For each target protein, a one-against-all 
alignment is performed with all database proteins and 
totally 100 X (6, 569 + 1, 912) = 848, 100 non-homolo- 
gous protein pairs are compared during the 
experiment. 



Scalability of ppsAlign 

There are two critical parameters for ppsAlign, namely 
the maximal number of iteration {Nuer) and the maximal 
number of MFS {Nseed)- Intuitively, when increasing Niter 
or Nseed> ppsAlign will often obtain better alignment 
quality but the execution time will be significantly 
lengthened. To verify this, we preliminarily investigate 
the performance of different settings using a small target 
dataset of 17 proteins and the dataset Dj in terms of 
RMSDioo- The experimental results of RMSD^oo with 
Niter = {3, 5, 7} and Nseed = {10, 30, 50, 70} are shown in 
Figure 6, which illustrates that ppsAlign has decreased 
RMSDioo when Niter and/or Nseed is increasing. This fig- 
ure can be used as a guideline for parameter selection 
of ppsAlign, For a fair comparison of efficiency improve- 
ment from ppsAlign to a concurrent method, we select 
a combination of Niter and Nseed that achieves compar- 
able alignment quality. 

Speedup over TM-align and CPU-based ppsAlign 

In this experiment, ppsAlign is executed with a para- 
meter setting of N^ter = 3 and Nseed = 20 which results 
in a comparable RMSDioo to TM-align and the CPU 
version of ppsAlign. Table 1 summarizes the alignment 
quality, average execution time, and corresponding 
speedup. ppsAlign achieves speedups of 23.8 and 35.9 
compared to CPU-based ppsAlign and TM-align, respec- 
tively. The detailed comparison of alignment quality of 
ppsAlign and TM-align can be found in Additional file 
1: Table SI. 

Speedup over Fr-TM-align 

Since Fr-TM-align performs more iterations to improve 
its alignment quality over TM-align, we increase both 
iteration and seed numbers of ppsAlign algorithm to 
achieve a comparable alignment quality with Fr-TM- 
align. The experimental results of RMSD^oo^ average 
execution time, and corresponding speedup with N^ter = 
6 and Nseed = 30 are shown in Table 2. ppsAlign 
achieves speedup 64.7 compared to Fr-TM-align with 
the same alignment quality. The detailed comparison of 
alignment quality of ppsAlign and Fr-TM-align can be 
found in Additional file 1: Table S2. 

Speedup over MAMMOTH 

In the last experiment, we use the same dataset to com- 
pare the performance of ppsAlign and MAMMOTH. 
Different from TM-align and Fr-TM-align, MAM- 
MOTH is originally developed for the purpose of large- 
scale comparisons with high efficiency at the cost of the 
reduction of alignment quality. Because of its high 
speed, MAMMOTH is used as a benchmark for maxi- 
mal speed on the CPU platform in [39]. The experimen- 
tal results of RMSD^oo^ average execution time, and 
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Nseed=10 Nseed=30 Nseed=50 Nseed=70 



B 

5.75 




Nseed=10 Nseed=30 Nseed=50 Nseed=70 



C 

5.75 




Nseed=10 Nseed=30 Nseed=50 Nseed=70 



Figure 6 Performance comparison of ppsAlign with different settings of Nseed and Nuer ppsAlign is running on NVIDIA Tesia C2050 GPU 
card with a small target dataset of 17 proteins. The parameter settings of ppsAlign are Niter = {3, 5, 7} and Nseed = {10, 30, 50, 70}. (A) N^ter = 3. (B) 

N,er = 5. (C) N,er = 7. 



corresponding speedup with Niter = 1 and Nseed = 8 are 
shown in Table 3. ppsAlign achieves speedup 40.3 com- 
pared to MAMMOTH and higher alignment quality. 
The detailed comparison of alignment quality of ppsA- 
lign and MAMMOTH can be found in Additional file 1: 
Table S3. 



Discussion 

The framework of ppsAlign is a general-purpose GPU 
platform for protein structure alignment which could 
take many concurrent methods, such as TM-align [10] 
and Fr-TM-align [11], into the parallelized algorithm 
design. An important novelty in our approach is to 
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Table 1 Average execution time of TM-align, CPU-based ppsAlign, and ppsAlign with parameter settings (Nuer = 3 and 

Nseed = 20). 



Dataset 


Methods 


RMSDinn 


Execution time (s) 


Speedup of ppsAlign 


Di 


ppsAlign 


5.7 


64 






CPU-based ppsAlign 


5.7 


1596 


24.9 




TM-Align 


5.7 


2170 


33.9 


D2 


ppsAlign 


5.3 


41 






CPU-based ppsAlign 


5.3 


899 


21.9 




TM-Align 


5.3 


1597 


39.0 


Total 


ppsAlign 




105 






CPU-based ppsAlign 




2495 


23.8 




TM-Align 




3767 


35.9 



RMSDioo is tlie expected value of cRMSD if tlie two protein structures were 100 residues long. ppsAlign is running on NVIDIA Tesia C2050 GPU card and other 
methods (CPU-based ppsAlign and TM-align) are running on a computer with AMD Opetron dual-core 1000 series processor at 1.8 GHz and 8 GB RAM. The 
parameter settings of ppsAlign are Niter - 3 and Nseed - 20 



create a unique design to manage resources of the GPU 
architecture. First, an intelUgent decomposition of the 
application in kernels characterized by different paralle- 
lization strategies is provided. In the existing methods 
for GPU-based sequence alignment mentioned pre- 
viously, a pair-wise comparison is either assigned to a 
thread (i.e., inter-task parallelization) or corporately per- 
formed by a block of threads (i.e., intra-task paralleliza- 
tion) [18,20]. However, as the workflow of structure 
alignment is more complicated than that of sequence 
alignment, neither the inter- nor the intra- task paralleli- 
zation can efficiently exploit the GPU computing power. 
Therefore, ppsAlign utilizes a hybrid inter- and intra- 
task parallel model. In particular, each task (i.e., pair- 
wise structural comparison) is divided into several inde- 
pendent seed alignments. Each seed alignment is 
assigned to a different thread (inter-task parallelization), 
whereas each block executes one or more pair-wise 



Table 2 Average execution time of Fr-TM-align and 
ppsAlign with parameter settings (Niter = 6 and Nseed = 
30). 



Dataset 


Methods 


RMSDioo 


Execution time 

(s) 


Speedup of 
ppsAlign 




ppsAlign 


5.4 


326 






Fr-TM- 
align 


5.4 


19849 


60.9 


D2 


ppsAlign 


5.1 


224 






Fr-TM- 
align 


5.1 


15729 


70.2 


Total 


ppsAlign 




550 






Fr-TM- 
align 




35578 


64.7 



RMSDioo is the expected value of cRMSD if the two protein structures were 
100 residues long. ppsAlign is running on NVIDIA TesIa C2050 GPU card and 
Fr-TM-align is running on a computer with AMD Opetron dual-core 1000 
series processor at 1.8 GHz and 8 GB RAM. The parameter settings of ppsAlign 
are Njter = 6 and Nseed - 30 



comparisons (intra-task parallelization). Second, a smart 
design of memory layout and memory access patterns 
are developed, the former allowing an effective use of 
the memory capacity at the different levels of the GPU 
memory hierarchy, and the latter minimizing the mem- 
ory bandwidth requirement of the application. Third, 
several efficient algorithms for avoiding complex control 
flow on GPU are proposed to take advantage of the 
STMT nature of the GPU. For instance, a feature-based 
measure is used to compute similarity of fragment at 
the fragment-level alignment which can avoid time-con- 
suming RMSD calculation at the initial stage of struc- 
ture alignment. 

One of the major ways in which ppsAlign differs to 
other methods is implementing protein structure align- 
ment at the residue level on GPU. Recently, the GPU- 
enhanced algorithms are gaining an increasing attention 
in bioinformatics. One of the major steps was a GPU 
implementation of a one-against-all sequence compari- 
son using Smith-Waterman algorithm [20,21]. With 
these methods, a sequence database search can be per- 
formed resulting in a list of similarity scores, while these 

Table 3 Average execution time of MAMMOTI-I and 



ppsAlign with parameter settings (Nuer = 1 anci Nseed = 8)* 



Dataset 


Methods 


RMSDioo 


Execution time 

(s) 


Speedup of 
ppsAlign 




ppsAlign 


6.3 


10 






MAMMOTH 


10.3 


470 


47.0 


D2 


ppsAlign 


5.9 


8 






MAMMOTH 


9.2 


255 


31.9 


Total 


ppsAlign 




18 






MAMMOTH 




725 


40.3 



RMSDioo is the expected value of cRMSD if the two protein structures were 
100 residues long. ppsAlign is running on NVIDIA TesIa C2050 GPU card and 
MAMMOTH is running on a computer with AMD Opetron dual-core 1000 
series processor at 1.8 GHz and 8 GB RAM. The parameter settings of ppsAlign 
are Njter = 1 and Nseed = 8 
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methods do not provide the detailed aUgnment informa- 
tion of the best hits [23]. To provide detailed residue- 
residue correspondence, GPU-BLAST [24] was devel- 
oped, that allowed to accelerate the NCBI-BLAST 
search, achieving the speedup between 3 and 4 on an 
NVIDIA Tesla C2050 GPU card. In addition, another 
approach to protein sequence that uses backtracking on 
GPU to construct alignment of residues has been pro- 
posed [23]. Compared to the sequence alignments, the 
implementation of structure alignment on GPU is a 
more challenging task, because some routines (e.g., 
RMSD calculation) can cause severe divergence among 
GPU threads and decrease performance of GPU. One of 
the first structure comparison methods implemented on 
GPU, SA Tableau Search [27], aligns protein substruc- 
ture at the secondary structure level, that is by aligning 
secondary structure elements, while not aligning struc- 
tures at the residue level. To the best of our knowledge, 
ppsAlign is the first protein structure comparison plat- 
form for GPU that provides the residue level structural 
alignment. 

The substantial contribution of ppsAlign is to pro- 
vide a high-performance computing platform for the 
research community. An alternative solution to accel- 
erate the protein structure alignment is to install more 
CPU computing cores in a single machine. However, 
using more CPU cores in a single machine need to 
upgrade main board and memory accordingly, which 
could decrease price/performance ratio. In contrast, 
installing a GPU card into a PCIe (Peripheral Compo- 
nent Interconnect Express) slot does not require extra 
cost and more GPU cards can be installed into one 
PCIe slot by a switch. In this paper, an NVIDIA Tesla 
C050 GPU card is utilized to evaluate performance, 
which has also been used in GPU-BLAST [24]. Though 
it is a high end product of NVIDIA, we expect its price 
will drop in the near future due to market demand in 
gaming industry. 

Conclusions 

This paper presents ppsAlign for large-scale protein 
structure alignment using GPUs. ppsAlign employs an 
index-based search procedure to find seeds of matched 
fragment sets, and then iteratively refines the seeds with 
fragment- and residue- level alignments. We provide an 
in-depth comparison of ppsAlign against several concur- 
rent CPU-based methods. Our experimental results 
show that ppsAlign can achieve significant speedup over 
its CPU implementation, TM-align, Fr-TM-align, and 
MAMMOTH on a single NVIDIA Tesla C2050 GPU. 

We emphasize that the framework of ppsAlign is not 
designed as a replacement for the existing structural 
alignment tools, but as a general-purpose platform for 
protein structure alignments on GPU. With this 



platform, we can parallelize the existing algorithms (e.g., 
TM-align and Fr-TM-align) on GPU and utilize the 
massive parallel computing power of GPU to achieve 
high-throughput structural comparisons without sacrifi- 
cing alignment quality. 

AvailabMity and requirements 

♦ Project name: ppsAlign 

♦ Project home page: http://proteindbs.rnet.missouri. 
edu/ppsalign/ppsalign.html 

♦ Operating system(s): Linux 

♦ Programming language: CUD A, JAVA, and PHP 

♦ License: none 

Availability of supporting data 

The data sets supporting the results of this article are 
available in the Worldwide Protein Data Bank reposi- 
tory, http://www.wwpdb.org/. 

Additional material 



Additional file 1: Figure SI. In this example, one leaf node tV from 
the indexing tree of the target protein Q is used to search the indexing 
tree of entire protein database A and m best matched nodes are 

returned. In this example, node is represented by a representative 

which is a "structure medium" from three similar substructures {Uj^ /, 

^j, 2, Uj, 3} from Q. A search of cj^ on the indexing tree of ^ returns two 

database leaf nodes, tf and • tf node, represented by cf , has 
two groups of similar substructures {c/,; /, c/,; i 2} and {c/,; 2, i, di, 2, 2, ^i, 2, 
3} which are from database proteins P/ and P2/ respectively, node, 

represented by , has two groups of similar substructures [6^^ /, 6\^^ 
2} and {d\^^ 3^ c/^^^ 3^ 2, 3, 3} from database proteins and P3, 
respectively. The RMSD of \yC^ , cf | and \yC^ , cf | is below a cutoff 
(4.5A). After substructure searching, the target protein Q can be 
represented by = {uj^ /, Uj^ 2, ^j, 3}- The database proteins P/, P2, and 
P3 can be represented by Q^^ = {c/,; d,^ ^ 2, dk, i, v d^i 2I Q^^ = id,, 
2, h di, 2, 2, di, 2, 3}, and = {^/^^ ^ ^j^^ ^ d^, 3, 3I respectively. After 

projecting the substructures to fragments, we have three MFS' for node /' 
of the indexing tree of Q for P?, P2, and P3. Table SI. Comparison of 
alignment quality (RMSDtoo) of ppsAlign and TM-align. The table 
compares the alignment quality measured in RMSDioo of the 100 target 
proteins using ppsAlign and TM-align. Table S2. Comparison of alignment 
quality (RMSDioo) of ppsAlign and Fr-TM-align. The table compares the 
alignment quality measured in RMSDioo of the 100 target proteins using 
ppsAlign and Fr-TM-align. Table S3. Comparison of alignment quality 
(RMSDioo) of ppsAlign and MAMMOTH.The table compares the alignment 
quality measured in RMSDioo of the 100 target proteins using ppsAlign 
and MAMMOTH. 
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